benchmarks: Initial DirectTrajOpt benchmark suite (Ipopt vs MadNLP)#75
Open
jack-champagne wants to merge 15 commits into
Open
benchmarks: Initial DirectTrajOpt benchmark suite (Ipopt vs MadNLP)#75jack-champagne wants to merge 15 commits into
jack-champagne wants to merge 15 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…pt-initial" This reverts commit 83da076.
- Add docs/src/benchmarks.md mirroring CuQuantum.jl pattern: problem description, result tables, environment info, reproduction instructions - Wire benchmarks page into docs/make.jl - Extract shared problem constructors to benchmark/problem_utils.jl (make_bilinear_problem, make_scaled_problem) to eliminate duplication - Remove hardcoded package_version, pass runner consistently - Update README with docs link and regression detection example Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove fragile Base.loaded_modules_order lookup for MadNLPSolverExt; MadNLPOptions is exported directly from DirectTrajOpt - Change MadNLP print_level from 1 (TRACE) to 6 (ERROR) to match Ipopt's silent output and avoid polluting benchmark stdout - Derive state_dim/control_dim from problem_dims(prob) instead of hardcoding in micro-benchmark MicroBenchmarkResult - Set BENCHMARK_RUNNER=github-actions in CI workflow so results are distinguishable from local runs - Remove unused ForwardDiff and LinearAlgebra from benchmark/Project.toml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pin HarmoniqsBenchmarks source to fix/widen-dto-compat branch which has DirectTrajOpt compat "0.8, 0.9" (fixes Pkg.instantiate failure) - Run JuliaFormatter on benchmark files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…merge The compat-widening branch has merged to HBJ main, so point the benchmark Project.toml at main. HBJ is not yet registered in General, so we still need the explicit URL+rev source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8d3ff5f to
ff9d6f6
Compare
The uncommented top-level `Base.loaded_modules_order` lookup and the `for seed = 0:99` wins-counting loop are scope creep from the original PR — the file is not picked up by `@run_package_tests` nor referenced anywhere, so the only visible effect of uncommenting was a future landmine: the top-level lookup will throw `BoundsError` if MadNLPSolverExt is not loaded, and the loop contains `exit(1)` which would kill the test runner outright. Restore the original commented-out state. If we want to revive this as a real test later, wrap it in `@testitem` with `@test err < 1e-3` instead of `exit(1)`.
`contains(ti.filename, "benchmark")` would silently exclude any future testitem whose filename happens to contain "benchmark" (e.g. `test/regressions/hessian_benchmark.jl`). Switch to a path-component match via `splitpath` so only files actually under a `benchmark/` directory are excluded.
Tracking `rev = "main"` makes benchmark CI results non-reproducible: a force-push or follow-up commit on HBJ silently changes the benchmark schema for every old commit's CI re-run. Pin to the v0.2.0 prep tip (5401542c — convergence schema + extractors), bump as needed when we adopt new HBJ helpers. Drop in favor of `[compat]` once HBJ registers in General.
`make_scaled_problem` accepted `seed::Int = 42` but the scaling sweep never passed one — so every cell used seed 42, which means the timing table was effectively measuring nine variations of the same random instance rather than a sweep across instances. Generate a deterministic per-cell seed (`1000 + 100*N + dim`). Both solvers still receive the same instance for that cell so the Ipopt-vs-MadNLP comparison stays apples-to-apples; only across cells does the underlying problem change. This is still single-shot per cell — a follow-up could sweep K seeds and report median if we want statistical robustness.
The bare `try … catch end` blocks for `pkg_version` and the inline `try … catch; ; "unknown" end` for the git SHA both hid real failures silently (and the latter had a syntactically dubious `;` continuation). Capture the exception with `@warn` so CI surfaces a "Failed to look up …" message if the lookup ever breaks, rather than letting downstream results carry "unknown" with no explanation. Also hoist the git rev-parse into a named local for readability.
First-call JIT compile (Ipopt/MadNLP extension load, KKT/AD codegen) dominates wall-time on small problems and biases the Ipopt-vs-MadNLP comparison by solver order — whichever solver runs first pays the MOI extension JIT that the second avoids. Add a tiny `max_iter = 2` warmup solve per solver before the timed section in both the macro-benchmark and the memory-scaling sweep. Each @testitem runs in its own Julia process, so the warmup needs to be inside the testitem (compilation cache is per-process). This makes the reported wall-times steady-state numbers rather than "cold start + a few iterations of solve work".
The numbers in the benchmark page were single-run local samples that
got captured into the docs build verbatim. Without a workflow wiring
real CI artifacts into the docs, presenting them as bolded headline
claims ("MadNLP is 33% faster", "40–45% less memory") would let them
become the canonical DirectTrajOpt figures people quote — which is
misleading because they shift with hardware, BLAS, MUMPS, Julia
version, and even solver order before the warmup fix.
- Add an upfront note that the tables are example output showing the
schema of each benchmark, not pinned reference measurements.
- Remove the "33% faster / 43% fewer / 40–45% less / 30–35% faster"
bolded comparison sentences.
- Reframe the scaling table around what it's actually good for: each
solver vs itself over time, not a single-shot solver comparison.
Real numbers should come from the CI workflow's JLD2 artifacts. Wiring
those into the docs build is a follow-up.
JuliaFormatter line-broke the `DirectTrajOpt.solve!(...; options=...)` calls in the warmup blocks. Apply the formatter pass so the Formatter CI check stays green.
jack-champagne
added a commit
that referenced
this pull request
May 20, 2026
The convergence suite under `benchmark/convergence/` uses `@testitem` blocks that depend on `HarmoniqsBenchmarks` — which isn't in the package's main test environment. `@run_package_tests` was picking those up and failing with `ArgumentError: Package HarmoniqsBenchmarks not found in current path` on Julia 1.10/1.11/1.12. Filter out anything whose path includes a `benchmark` component. Match the path component (via `splitpath`) rather than a substring so future test files like `foo_benchmark.jl` aren't accidentally excluded. Mirrors the same filter added in PR #75 (initial benchmark suite).
MadNLP's `print_level` is a LogLevels enum where 0 isn't a valid value (`ArgumentError: invalid value for Enum LogLevels: 0`). The timed benchmark calls in this file already use `print_level=6` (silent at ERROR level) — make the warmup solves match. Caught by the benchmark CI workflow.
3 tasks
Single-shot timing on random instances was noisy enough to be
misleading. The previous CI run on this PR produced one anomalous cell:
N=25, dim=8: Ipopt 0.011s / 22KB alloc ← almost certainly an
iter-0 termination on a
seed-degenerate initial
point
The seed for that cell was `1000 + 100*25 + 8 = 3508`. Other cells
trended roughly as expected with problem size, but cell-to-cell single
shots are noisy enough that the comparison numbers aren't trustworthy
(e.g. MadNLP at 25×16 took 29s but at 51×16 took 39s; MadNLP at 25×4
was slower than at 51×4).
Switch to median over `n_seeds = 3` independent random instances per
(N, dim) cell:
- Seed scheme: `1000 + 100*N + dim + 10_000*(k-1)` for k ∈ 1:3.
Per-cell distinct, per-sample distinct.
- Both solvers receive the *same* instance for each (cell, seed) so
per-seed Ipopt-vs-MadNLP comparisons stay fair.
- All 3×K BenchmarkResults per cell are saved to JLD2 — the printed
table shows medians but the raw distribution is preserved for
downstream analysis.
Adds Statistics to benchmark/Project.toml for `median`.
Runtime cost: ~3x the previous scaling-sweep section, which puts the
whole benchmark workflow at ~15 min wall time (previously ~9 min,
still well under the workflow's default 6h timeout).
Docs page (`docs/src/benchmarks.md`) updated to describe the
median-of-K scheme and note that per-seed results are persisted to
the JLD2 artifact.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-open of #67 (was merged into wrong base, reverted, re-applied).
Summary
Initial benchmarking infrastructure for DirectTrajOpt.jl comparing Ipopt and MadNLP solvers.
Depends on: HarmoniqsBenchmarks.jl
What's included
benchmark/Project.toml— Benchmark environment with HarmoniqsBenchmarks, BenchmarkTools, TestItems, MadNLPbenchmark/benchmarks.jl— Three@testitemsuites:.github/workflows/benchmark.yml— CI workflowTest plan
🤖 Generated with Claude Code